Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures

نویسندگان

  • Emmanuel Jeannot
  • Guillaume Mercier
چکیده

Abstract. MPI process placement can play a deterministic role concerning the application performance. This is especially true with nowadays architecture (heterogenous, multicore with different level of caches, etc.). In this paper, we will describe a novel algorithm called TreeMatch that maps processes to resources in order to reduce the communication cost of the whole application. We have implemented this algorithm and will discuss its performance using simulation and on the NAS benchmarks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Multi-core on HPC Applications in Virtualized Systems

In this paper, we evaluate the overheads of virtualization in commercial multicore architectures with shared memory and MPI-based applications. We find that the non-uniformity of memory latencies affects the performance of virtualized systems significantly. Due to the lack of support for non-uniform memory access (NUMA) in the Xen hypervisor, shared memory applications suffer from a significant...

متن کامل

Mouvement de données et placement des tâches pour les communications haute performance sur machines hiérarchiques

The emergence of multicore processors led to an increasing complexity inside the modern servers, with many cores, distributed memory banks and multiple Input/Output buses. The execution time of parallel applications depends on the efficiency of the communications between computing tasks. On recent architectures, the communication cost is largely impacted by hardware characteristics such as NUMA...

متن کامل

Design of Scalable PGAS Collectives for NUMA and Manycore Systems

The increasing number of cores per processor is turning multicore-based systems in pervasive. This involves dealing with multiple levels of memory in NUMA systems, accessible via complex interconnects in order to dispatch the increasing amount of data required. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bott...

متن کامل

MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

Over the last decade, Message Passing Interface (MPI) has become a very successful parallel programming environment for distributed memory architectures such as clusters. However, the architecture of cluster node is currently evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware. Although regular MPI implementations ar...

متن کامل

Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs

Even with advances in materials science, fundamental limits in heat and power distribution are preventing higher CPU clock frequencies. Industry solutions for increasing computation speeds have concentrated on raising the number of computational cores available, leading to the wide-spread adoption of so-called “fat” nodes. However, keeping all the computation cores busy doing useful work is a c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010